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ABSTRACT 

Crowd flow segmentation is an important step in many video surveil¬ 
lance tasks. In this work, we propose an algorithm for segment¬ 
ing flows in H.264 compressed videos in a completely unsupervised 
manner. Our algorithm works on motion vectors which can be ob¬ 
tained by partially decoding the compressed video without extracting 
any additional features. Our approach is based on modelling the mo¬ 
tion vector held as a Conditional Random Field (CRF) and obtain¬ 
ing oriented motion segments by flnding the optimal labelling which 
minimises the global energy of CRF. These oriented motion seg¬ 
ments are recursively merged based on gradient across their bound¬ 
aries to obtain the flnal flow segments. This work in compressed 
domain can be easily extended to pixel domain by substituting mo¬ 
tion vectors with motion based features like optical flow. The pro¬ 
posed algorithm is experimentally evaluated on a standard crowd 
flow dataset and its superior performance in both accuracy and com¬ 
putational time are demonstrated through quantitative results. 

Index Terms — Crowd Flow Segmentation, Conditional Ran¬ 
dom Fields, H.264 Compressed Videos, Compressed Domain Pro¬ 
cessing 

1. INTRODUCTION 

Video Surveillance having become ubiquitous these days, enormous 
amounts of video data is captured by cameras all around us. This has 
made it next to impossible for any security personnel/organisation to 
follow and analyse these videos manually and make intelligent de¬ 
cisions. Fortunately, the research in computer vision is moving to¬ 
wards automating this process. In the past decade, automated video 
surveillance has become an important research topic in the held of 
computer vision. Research in video surveillance involves tackling 
problems like object/person detection, recognition, tracking, flow 
analysis, anomaly detection etc. 

Extracting the dominant flows present in a video forms an im¬ 
portant preliminary step for many video surveillance tasks. Flow in a 
video can be deflned as a dominant path along which there is signif¬ 
icant motion throughout the video. A video can have multiple flows 
and neither the number of flows nor the path of each flow is known 
apriori. This makes the problem of flow segmentation challenging. 
In this work, we propose an algorithm to perform flow segmentation 
from videos stored in H.264 compression format |H| in an unsuper¬ 
vised manner. H.264 is popular choice for video compression as it 
allows high resolution videos to be stored and transferred at a rela¬ 
tively low bandwidth. Our approach is that of segmenting the flows 
in the video without the need to completely decode the H.264 com¬ 
pressed video and without extracting any features other than motion 


vectors. This avoids the additional overhead of computing optical 
flow vectors from videos to characterise flows and makes the task of 
flow segmentation computationally minimal. 

Conditional Random Fields (CRF) ||2l, which have been used 
extensively for vision research in the last two decades EliiElii, 
are known to work well for problems like image segmentation O Q . 
We model the problem of flow segmentation as an optimisation prob¬ 
lem within the framework of CRF. 

The rest of the paper is organised as follows: Section 2 gives a 
brief overview of the recent research in flow segmentation in both 
compressed and pixel domains. Section 3 presents the proposed 
algorithm and section 4 discusses its experimental evaluation and 
analysis. We conclude with a summary of the proposed method in 
section 5. 

2. RELATED WORK 

In the recent past, quite a few novel approaches have been proposed 
for crowd analysis both in the pixel and compressed domain. In 
this section we discuss some of these approaches. Ali et al. p8( pro¬ 
posed a Lagrangian dynamics based approach for segmentation and 
analysis of crowd flow. Their approach involves generating a flow 
fleld and propagating particles along them using numerical integra¬ 
tion methods. The space-time evolution of these particles is used to 
setup a Finite Time Lyapunov Exponent fleld, which can capture the 
underlying Lagrangian Coherent Structure (LCS) in the flow. Dy¬ 
namics and stability of the LCS reveal various flow segments present 
in the video. 

Rodriguez et al. 13 proposed an algorithm for crowd analysis 
which is primarily based on prior learning of behavioural patterns 
from a large dataset of crowd videos. Crowd analysis is carried out 
by matching patches from a given test video with that of the dataset 
and by transferring the corresponding behavioural patterns. 

Wu et al. (TOl proposed crowd motion partitioning algorithm 
based on representing optical flow features in salient regions as a 
scattered motion fleld. By initially making an approximation that 
the local crowd motion is translational in nature, the authors develop 
a Local-Translation Domain Segmentation (LTDS) model. They fur¬ 
ther extend this to scattered motion flelds to achieve crowd motion 
partitioning. 

The above discussed approaches work in pixel domain and in¬ 
volve extracting features like optical flow from the uncompressed 
video. In compressed domain, Gnana et al. m proposed a flow seg¬ 
mentation algorithm for H.264 compressed videos using motion vec¬ 
tors. Their approach involves detecting region of interest in a video 
and clustering motion vectors extracted from those locations using 
Expectation Maximisation. Later the motion clusters are merged to 


form flows based on Bhattacharya distance between the histogram 
of orientation of motion vectors at the boundaries of clusters. 

Again in H.264 compressed format, Biswas et al. ca proposed 
a segmentation algorithm for crowd flow based on super-pixels. The 
mean motion vectors are colour coded and superpixel segmentation 
is performed at different scales. These segments, obtained at differ¬ 
ent scales, are merged based on boundary potential between super¬ 
pixels to obtain flow segments. 

3. PROPOSED METHOD 

Our approach is based on formulating the flow segmentation prob¬ 
lem as a CRF optimisation problem using motion vectors as fea¬ 
tures. We assign a motion vector to every 4x4 pixel block in the 
video by replicating motion vectors obtained from the correspond¬ 
ing local macro-blocks. This is to facilitate the construction of CRF 
on an uniform image grid. Following this, a mean motion vector 
held is generated by temporally averaging the motion vectors at ev¬ 
ery spatial location in the video across all frames. The magnitude 
and orientation components of this mean motion vector held for a 
test video are shown in the Fig[2(c) and (e) respectively. The task of 
crowd flow segmentation in a video can be thought of as an image 
segmentation problem with the image being the mean motion vector 
held. This held can be considered as an image with two channels - 
magnitude and orientation of the 2D motion vectors. 

CRFs are undirected graphical models for structured prediction 
where the global inference is made from locally deflned clique po¬ 
tentials. They have been rigorously used for image segmentation in 
the last two decades and have been proved to be great tools for this 
task. 

CRF is constructed on an image grid with the video’s spatial 
dimensions and with a 4-neighbourhood connectivity. Here, each 
node in the CRF corresponds to the spatial location of a 4x4 pixel 
block in the video and is connected to its left, right, top and bottom 
nodes. The mean motion vector corresponding to the spatial location 
of each node in the CRF is taken as its feature. Let the motion vector 
feature corresponding to a node at location u be with magnitude 

and orientation Jq. Let the label associated with this node be 
Xu, where Xu is a discrete random variable. This CRF with the mean 
motion vector features is illustrated in Fig|^(a). 

Ideally, in this CRF formulation, each label should correspond to 
a flow present in the video. But the number of flows as well as their 
paths are unknown apriori. Hence the flow segmentation problem is 
approached by initially segmenting the motion vector held based on 
orientation. In this, each orientation segment clusters motion vectors 
lying along a speciflc direction. Later, these motion orientation seg¬ 
ments are merged together based on their proximity and continuity 
to obtain coherent flow segments. Since various motion orientations 
present in the video are also unknown apriori, the labels of the CRF 
are created to support all possible motion orientations: —180° to 
180° in steps of 10°. An additional label is created to prune out the 
noisy motion vectors corresponding to the background in the video. 
This background label supports motion vectors with magnitude less 
than a certain threshold irrespective of their orientation. 

Speciflcally, for orientation based segmentation, the unary po¬ 
tential of a node at location u with feature and label Xu is deflned 
as follows: 


(c) 


I 

(f) 

Fig. 1. (a) Frame from a test sequence (c) Magnitude components 
of motion vector fleld (e) Orientation components of motion vec¬ 
tor fleld (b) Segmentation result from coarse CRF (d) Segmentation 
result from flne CRF (f) Final flow segmentation result. 



where = I, 360 - |/e“ - ^“1) (2) 

Here, the label Xu = ^ corresponds to the background and r 
is a soft threshold on the magnitude of motion vectors to determine 
if they belong to the background. ci, C 2 are constants determined 
empirically. Other labels, Xu / 0, correspond to motion along var¬ 
ious orientations. 0^'^ is the orientation supported by the label Xu 
and takes one of the values among {—170°,..., 0°,..., 170°, 180°}. 
Z(/^, 0 ^'^) denotes the angle between two vectors with orientations 
fg ,0^^ and is computed as given in Eq.(|^. 

The pairwise potentials over the CRF are deflned in such a way 
so as to ensure smooth segmentation. This is done by assigning 
a pairwise cost between neighbouring nodes, which take different 
labels, proportional to the similarity between their node features. 
Speciflcally, the pairwise potential between two neighbouring nodes 
u and V is deflned as follows: 


'4^U,v{Xu-) Xy) - 


0 

C3 *(360 -Z(/,^/,")) 


if Xu — Xv 
if Xu y- Xy 


(3) 
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{Xu) 


0 if Xn = 0 & < T 

Cl if Xn = 0 & /^ > T 

C2 if Xn 7^ 0 & < T 


With the unary and pairwise potentials as deflned in Eq.([^ and 
,,, Eq.0, the total energy of the CRE is the sum of unary and pairwise 

terms: 











Algorithm 1 : Crowd Flow Segmentation 
Require: Video:!/ 

Ensure: Flow Segments:{F°, ., 

% Extract mean motion vector field from V 
MV = MeanMotionVectors{V) 

Labels: 0,1 ,K — 1 

%Label 0 corresponds to background & supports motion vectors 
of magnitude less than a threshold 

% Oloarse • Orientation supported by label i 

^Lar.e = -180 + i*10 Vz G [1 K - 1] 

%Extract Coarse Orientation Segments 

%Unary and Pairwise costs are defined in Eq.([^ and Eq.([^ 

{Scoarse, ■■■Sco'^rse} = CRFoptimisation{MV, 6coarse) 

i — 1 

for / = 0 ^ L — 1 do 

i\^coarse\^ ^i^^thresh) theU 

^}ine — M eanOrientation{S coarse) 

i = z + 1 

end if 
end for 



(a) CRE with motion feature vectors 




(b) Label orientations-coarse CRE (c) Label orientations-fine CRE 
Fig. 2. Eormulated CRE 


%Extract Eine Orientation Segments 

= CRFoptimisaUon{MV, 0fl„e) 


%Extract Elow Segments 

F\...F^-^} = Merge{S%^,, ...Sfr^) 


E{x) = '^2,‘ru{Xu) + '^^ 1 pu,v{Xu,Xv) (4) 

u u,v 

U^V 

Solving for the CRE, thus formulated, is equivalent to finding a 
labelling x* = [..., ...], which minimises the global en¬ 

ergy E{x) defined in Eq. The optimal labelling assigns a label 
to each node in the image grid, thus assigning it into either a back¬ 
ground segment or a segment with a specific orientation. The ori¬ 
ented motion segmentation result obtained is shown in Eig[^(b). 

Einding the exact solution for the minimum energy labelling 
problem is NP hard. In this work, an approximate solution for the 
CRE labelling is found out using the graph cuts based algorithm pro¬ 
posed in the works of ifT^IT^fTSlfTbl . Their algorithm converges 
quickly for grid graphs to a local minima by allowing large moves 
whenever possible. 

The motion segmentation, so obtained, is coarse and may not 
be very accurate. This is because the orientations supported by the 
CRE labels(—170°, —160°,..., 180°), need not closely align with 
the actual orientations present in the motion vector field. In order 
to further refine this segmentation, we formulate a fine CRE. The la¬ 
bels for this fine-CRE are obtained by taking the mean orientation of 
motion vectors contained in each coarse segment. Here we consider 
only segments whose size is greater than a certain threshold. This 
helps in eliminating noisy segments. This fine CRE is solved with 
the same unary and pairwise potentials as in Eq.([^ and Eq.([^ with 
0^'^ corresponding to the newly calculated orientations. The label 
orientations corresponding to the coarse CRE and the fine CRE are 


shown in Eig|^(b) and (c) respectively. The refined motion segmen¬ 
tation obtained after solving this fine CRE is shown in Eig[^(d). 

The final flow segmentation is obtained by appropriately merg¬ 
ing the refined oriented motion segments. Eor this purpose, we create 
a gradient image of the orientation channel of the motion vector field. 
Now, we consider the mean gradient along the boundary joining the 
two segments which are considered for merging. If this mean gra¬ 
dient is less than a certain threshold, the two segments are merged. 
The entire algorithm is summarised in Algorithm. The final flow 
segments obtained are shown in Eig[^(f). 


4. EXPERIMENTS 


The proposed method is evaluated on the fiow dataset provided by 
Ali et al. (S). The videos of this dataset have dense flows in both 
traffic and crowd scenarios. Since these videos are not originally 
present in H.264 format, we have followed the same procedure as 
Biswas et al. o for encoding. Specifically, the video is encoded 
into H.264 baseline with only I & P frames. One reference frame is 
considered with the Group of Pictures length set to 30. As mentioned 
in El , this baseline profile is ideal for extracting motion vectors on- 
the-fiy with low latency. The motion vectors extracted from the en¬ 
coded video can come from varying macro-block sizes (from 4x4 to 
16x16). The motion vectors obtained from bigger macro-blocks are 
replicated to their constituent 4x4 blocks to maintain grid uniformity 
and facilitate comparison of results with im. 

The fiow segments obtained using the proposed algorithm are 
quantitatively evaluated by comparing against the ground-truth seg¬ 
ments and using the Jaccard similarity measure. Let the ground-truth 
segmentation be A and the output of the proposed algorithm be B. 
The Jaccard measure, which is the value of intersection over union, 
for A and B can be computed as 


J{A,B) 


\AnB\ 

\AUB\ 


(5) 
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Biswas et al. ca 
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(c) Sequence 7 





Fig. 3. Qualitative results for crowd flow segmentation. (More results at http://val.serc.iisc.emet.in/srinivas/CRFFlowSeg.html ) 


Table 1. Jaccard Similarity Measure with Ground Truth Table 2. Computational Time (in sec) 


Test Sequences 

Ali et al.lSI 

Biswas et al.ll2l 

Proposed 

Sequence 1 

0.63 

0.60 

0.90 

Sequence 2 

0.28 

0.67 

0.66 

Sequence 3 

0.57 

0.74 

0.75 

Sequence 4 

0.67 

0.68 

0.68 

Sequence 5 

0.78 

0.24 

0.46 

Sequence 6 

0.41 

0.62 

0.81 

Sequence 7 

0.60 

0.15 

0.53 


Video Sequences 

Biswas et al.ll2l 

Proposed 

Sequence 1 

4.96 

0.20 

Sequence 2 

5.08 

0.31 

Sequence 3 

4.66 

0.23 

Sequence 4 

4.49 

0.33 

Sequence 5 

4.32 

0.08 

Sequence 6 

5.32 

0.31 

Sequence 7 

4.95 

0.38 


Here the intersection represents the number of non-zero labelled 
pixel locations which match in labelling A and labelling B. The 
union represents the number of pixel locations which are assigned a 
non-zero label in either A or B or both. 

The quantitative and qualitative results are shown in Table. 
and Figj^respectively. The timing results presented in Table, [^are 
based on experiments performed in MATLAB on a 3.4 GHz 64-bit 
Linux system with 24GB RAM. 

In Sequence 5, the frame size is 188 x 144 compared to 480 x 360 
for the other videos. Here the motion vectors could not capture mo¬ 
tion accurately enough resulting in bad performance. As long as the 
motion is well captured, the proposed approach is shown to perform 
better or equivalent to m, a pixel domain based approach. Compu¬ 
tationally, m takes around 30 sec for each sequence which is two 
orders of magnitude slower compared to the proposed method. 


5. CONCLUSION 

In this work, we have proposed an algorithm for crowd flow seg¬ 
mentation in the framework of CRFs. The node features for CRF 
are taken to be the motion vectors and unary and pairwise terms are 
so deflned to obtain cluster segments corresponding to motion along 
various orientations. Initially, we consider the labels for CRF to sup¬ 
port all possible orientations in the 360° plane and later reflne them 
based on orientations present in the video. The reflned orientation 
segments are recursively merged to obtain the flnal flow segments. 
Our method can also be applied in pixel domain by just replacing the 
motion vectors with optical flow vectors. 

One drawback of the proposed approach and other recent meth¬ 
ods ifTTIIT^ is their inability to handle intersecting flows. This work 
can be extended to segment time-varying flows by constructing a 
multi-modal model at every spatial location as opposed to just the 
mean statistics. 
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